Nextflow for Synthea EHR Data

Author

Joe McGirr | Bioinformatics Scientist

Published

February 7, 2026

Summary

This notebook describes a mini-project I put together to practice the basics of using Nextflow and explore simulated EHR data.

Synthea is an open-source synthetic health record generator that produces realistic patient-level clinical data (demographics, hospital encounters, conditions, medications, procedures). It was developed by the non-profit MITRE and is commonly used for pipeline protoyping.

In this workflow, I combine Synthea with a python parsing script in a simple and reproducible Nextflow pipeline. Synthea runs with user-defined inputs (patient count, random seed, location), producing raw CSVs with simulated patient information. A python ETL step then standardizes and aggregates the Synthea outputs into a tidy patient-level feature table. The final output from the Nextflow pipeline is a single parquet table that could be handed off to a Shiny app. I load an example parquet into R in this Quarto notebook to explore the output with a few visualizations.

Synthea

Links:

A Java executable runs the simulation engine and generates synthetic electronic health record data. You can specify parameters like the number of patients to generate, the location (city/state), and a random seed for reproducibility.

Download

wget https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jar

Example run

I modified the synthea.properties file to set the output format to CSV.

# Generate 10,000 synthetic patients from Denver, Colorado

java -jar /path/to/synthea-with-dependencies.jar \
-c /path/to/synthea/synthea.properties \
-p 10000 \
-s 18 \
Colorado Denver

Parsing Synthea CSVs

I created a python script to parse and standardize a subset of the Synthea CSV outputs into a patient-level feature table. The script performs the following:

  1. Reads the relevant CSV files (patients, encounters, conditions, medications).
  2. Cleans and standardizes column names.
  3. Calculates patient age from birthdate and filters patients based on optional age criteria.
  4. Counts the number of conditions, medications, hospital encounters, and ER admissions for each patient in the last 10 years.
  5. Generates binary flags for specific conditions (diabetes, hypertension, asthma) based on SNOMED codes.
  6. Outputs the final standardized table as a parquet file for hand-off to R/Shiny.

parse_synthea.py

import argparse
from pathlib import Path
import pandas as pd
import janitor


def load_csvs(input_dir):
    p = Path(input_dir)
    files = {}
    for name in ["patients", "encounters", "conditions", "medications"]:
        fp = p / f"{name}.csv"
        files[name] = pd.read_csv(fp).clean_names() if fp.exists() else pd.DataFrame()
    return files


def filter_last_n_years(df, date_col, cutoff_date):
    if df.empty or date_col not in df.columns:
        return df

    df = df.copy()

    df[date_col] = pd.to_datetime(
        df[date_col], errors="coerce", utc=True
    ).dt.tz_convert(None)

    return df.loc[df[date_col] >= cutoff_date]


def compute_features(files, age_min=None, age_max=None):
    patients = files["patients"].copy()
    patients["birthdate"] = pd.to_datetime(patients["birthdate"])
    patients["age"] = (
        (pd.Timestamp.today() - patients["birthdate"]).dt.days // 365
    ).astype(int)

    if age_min is not None:
        patients = patients[patients["age"] >= age_min]
    if age_max is not None:
        patients = patients[patients["age"] <= age_max]

    # count conditions, meds, encounters in the last 10 years
    cutoff_date = (pd.Timestamp.utcnow() - pd.DateOffset(years=10)).tz_localize(None)
    cond = files["conditions"].copy()
    enc = files["encounters"].copy()
    meds = files["medications"].copy()

    cond = filter_last_n_years(cond, "start", cutoff_date)
    enc = filter_last_n_years(enc, "start", cutoff_date)
    meds = filter_last_n_years(meds, "start", cutoff_date)

    cond_counts = cond.groupby("patient").size().rename("n_conditions")
    med_counts = meds.groupby("patient").size().rename("n_medications")
    enc_counts = enc.groupby("patient").size().rename("n_encounters")

    er_counts = (
        enc.loc[enc["description"].str.contains("Emergency room admission", na=False)]
        .groupby("patient")
        .size()
        .rename("n_er_admissions")
    )

    df = patients.set_index("id")[["age", "gender", "race"]].copy()
    df = (
        df.join(cond_counts, how="left")
        .join(med_counts, how="left")
        .join(enc_counts, how="left")
        .join(er_counts, how="left")
    )
    df["n_conditions"] = df["n_conditions"].fillna(0).astype(int)
    df["n_medications"] = df["n_medications"].fillna(0).astype(int)
    df["n_encounters"] = df["n_encounters"].fillna(0).astype(int)
    df["n_er_admissions"] = df["n_er_admissions"].fillna(0).astype(int)

    condition_code_dict = {
        "has_diabetes": {"44054006"},
        "has_hypertension": {"59621000"},
        "has_asthma": {"195967001"},
    }

    if not cond.empty and "code" in cond.columns:
        cond_codes = cond["code"].astype(str)

        for flag_name, codes in condition_code_dict.items():
            patients_with_condition = cond.loc[cond_codes.isin(codes), "patient"]
            df[flag_name] = df.index.isin(patients_with_condition).astype(int)
    else:
        for flag_name in condition_code_dict:
            df[flag_name] = 0

    df = df.reset_index().rename(columns={"id": "patient_id"})
    return df


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input-dir", required=True)
    parser.add_argument("--out", required=True)
    parser.add_argument("--age-min", type=int, default=None)
    parser.add_argument("--age-max", type=int, default=None)
    args = parser.parse_args()

    files = load_csvs(args.input_dir)
    table = compute_features(files, age_min=args.age_min, age_max=args.age_max)

    outp = Path(args.out)
    if outp.suffix == ".parquet":
        table.to_parquet(outp, index=False)
    else:
        table.to_csv(outp, index=False)


if __name__ == "__main__":
    main()

Example run

python3 parse_synthea.py \
--input-dir /path/to/synthea/output/csv \
--out standardized_table.parquet \
--age-min 18 \
--age-max 120

Nextflow

I created a simple pipeline to combine the Synthea data generation and parsing steps.

Download

Nextflow was just as easy to download and install as Synthea. Just need Java 17+. The documentation is a thing of beauty.

curl -s https://get.nextflow.io | bash

main.nf

nextflow.enable.dsl=2

params.patients = params.patients ?: 10000
params.city     = params.city ?: "Denver"
params.seed     = params.seed ?: 18
params.outdir   = params.outdir ?: "results"

process generate_synthea {
  tag { "${params.city}-${params.patients}" }
  publishDir "${params.outdir}/synthea/${params.city}-${params.patients}", mode: 'copy'

  output:
    path "output/csv"

  script:
  """
  mkdir -p output
  java -jar /home/jmcgir/scratch/synthea/synthea-with-dependencies.jar -c /home/jmcgir/scratch/synthea/make_csv/synthea.properties -p ${params.patients} -s ${params.seed} --output=./output ${params.city}
  """
}

process parse_and_standardize {
  tag "parse"
  publishDir "${params.outdir}/table", mode: 'copy'

  input:
    path synthea_dir

  output:
    path "standardized_table.parquet"

  script:
  """
  # pass the directory (synthea_dir) to the python ETL
  python3 /home/jmcgir/scratch/nextflow/scripts/parse_synthea.py --input-dir ${synthea_dir} --out standardized_table.parquet
  """
}

workflow {
  synthea_ch = generate_synthea()
  parse_and_standardize(synthea_ch)
}

Example run

nextflow run main.nf --patients 2000 --city Colorado Denver --seed 18 --outdir results

Exploring the data

R setup

Code
tidy_packages <- c("readr", "tidyr", "stringr", "ggplot2", "dplyr")

required_packages <- c(
  "arrow",
  "plotly",
  "gt",
  tidy_packages
)

invisible(lapply(required_packages, library, character.only = TRUE))

# Wong, B. Points of view: Color blindness. Nat Methods (2011).
bla <- '#000000'
blu <- '#0072b2'
grb <- '#56b4e9'
lir <- '#cc79a7'
gre <- '#009e73'
red <- '#d55e00'
org <- '#e69f00'
yel <- '#f0e442'
gry <- '#BBBBBB'

jam_cols <- c(blu, red, gre, org, grb, lir, gry, bla)
jam_shapes <- c(21, 22, 23, 24, 25)

options(ggplot2.discrete.colour = jam_cols)
options(ggplot2.discrete.fill = jam_cols)

main_theme <- theme(
  text = element_text(size = 14),
  axis.text = element_text(size = 12),
  axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)),
  axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0))
)
jam_theme <- theme_minimal() + main_theme
jam_theme_bw <- theme_bw() + main_theme

remove.boxplot.outliers <- function(fig) {
  stopifnot("plotly" %in% class(fig))
  fig$x$data <- lapply(
    fig$x$data,
    \(i) {
      if (i$type != "box") {
        return(i)
      }
      i$marker = list(opacity = 0)
      i$hoverinfo = "none"
      i
    }
  )
  fig
}

EDA plots

synth_table <- arrow::read_parquet(
  "//vast.affymetrix.com/shares/home-dir/jmcgir/scratch/nextflow/results/table/standardized_table.parquet"
)

head(synth_table) |>
  gt() |>
  tab_header(title = "First 6 rows of standardized_table.parquet")
First 6 rows of standardized_table.parquet
patient_id age gender race n_conditions n_medications n_encounters n_er_admissions has_diabetes has_hypertension has_asthma
5bab1293-644c-3564-23b6-bd095720101a 1 F white 6 1 7 0 0 0 0
5f3f9386-383f-d368-4656-32abba73047c 1 M native 6 2 11 0 0 0 0
35f037eb-435b-ebf4-7c1a-5c362ee0e670 2 M white 6 0 10 0 0 0 0
712f82b7-745a-7589-cb2a-81d2aef0c8de 2 F white 5 1 11 0 0 0 0
a7ab6b2d-0a9c-39d6-dde5-f6df46d1c22d 3 M white 7 0 12 0 0 0 0
cb1a2337-7b1e-72ff-b37e-d455e0ded5f7 0 F white 2 0 3 0 0 0 0
n_patients <- nrow(synth_table)

p1 <- ggplot(synth_table, aes(age, fill = gender)) +
  geom_histogram(bins = 30, color = bla) +
  facet_wrap(~gender, nrow = 2) +
  jam_theme_bw +
  labs(
    title = paste0("Age distribution of ", n_patients, " synthetic patients"),
    x = "Age (years)",
    y = "Number of patients"
  )
p1

d1 <- synth_table |>
  select(
    patient_id,
    n_medications,
    has_diabetes,
    has_hypertension,
    has_asthma,
    gender,
    age
  ) |>
  pivot_longer(
    cols = !c(patient_id, n_medications, gender, age),
    names_to = "diagnosis",
    values_to = "value"
  ) |>
  filter(value == 1)

p1 <- ggplot() +
  geom_boxplot(data = d1, aes(diagnosis, n_medications), outlier.shape = NA) +
  geom_jitter(
    data = d1,
    aes(diagnosis, n_medications, label = patient_id, label2 = age),
    alpha = 0.9,
    height = 0,
    width = 0.1,
    shape = 1
  ) +
  jam_theme_bw +
  labs(
    title = "Medications prescribed for patients that have been daignosed in the past 10 years with:\nasthma, diabetes, and/or hypertension",
    y = "Number of medications"
  )
remove.boxplot.outliers(ggplotly(p1))
p1 <- d1 |>
  filter(diagnosis %in% c("has_diabetes", "has_hypertension")) |>
  ggplot(aes(age, n_medications, color = diagnosis)) +
  geom_point(shape = 1, size = 2) +
  facet_wrap(~diagnosis, scales = "free") +
  geom_smooth(method = "lm") +
  jam_theme_bw +
  labs(
    title = "Age vs. number of medications prescribed in the past 10 years",
    x = "Age (years)",
    y = "Number of medications"
  )
p1

pca_features <- c(
  "age",
  "n_medications",
  "n_encounters",
  "n_er_admissions"
)

metadata_cols <- intersect(
  c("patient_id", "gender", "race", "has_diabetes", "age", "n_encounters"),
  names(synth_table)
)

pca_table <- synth_table |>
  select(all_of(c("patient_id", pca_features))) |>
  mutate(across(all_of(pca_features), as.numeric)) |>
  drop_na() |>
  tibble::column_to_rownames("patient_id")

head(pca_table) |>
  gt() |>
  tab_header(title = "First 6 rows of patient features used for PCA")
First 6 rows of patient features used for PCA
age n_medications n_encounters n_er_admissions
1 1 7 0
1 2 11 0
2 0 10 0
2 1 11 0
3 0 12 0
0 0 3 0
pca_fit <- prcomp(pca_table, scale. = TRUE, rank. = 20)

pc <- as.data.frame(pca_fit$x)
pc$patient_id <- row.names(pc)


patient_metadata <- synth_table |>
  select(all_of(metadata_cols)) |>
  distinct()

pc <- inner_join(pc, patient_metadata, by = "patient_id")

var_exp <- round(summary(pca_fit)$importance[2, ], 4) * 100

plot.pca <- function(column_to_color_by, pc_x, pc_y) {
  pc_x_lab <- paste0("PC", pc_x)
  pc_y_lab <- paste0("PC", pc_y)

  p1 <- pc |>
    ggplot(aes(
      x = !!sym(pc_x_lab),
      y = !!sym(pc_y_lab),
      color = !!sym(column_to_color_by),
      text = patient_id
    )) +
    geom_point(size = 3) +
    theme_minimal(base_size = 14) +
    xlab(paste0("PC", pc_x, " (", var_exp[pc_x], "%)")) +
    ylab(paste0("PC", pc_y, " (", var_exp[pc_y], "%)"))

  return(p1)
}


ggplotly(plot.pca("age", 1, 2))
ggplotly(plot.pca("n_encounters", 1, 2))
loadings <- as.data.frame(pca_fit$rotation)
loadings$feature <- row.names(loadings)

loadings |>
  select(feature, PC1, PC2) |>
  arrange(desc(abs(PC1))) |>
  gt() |>
  tab_header(title = "PCA Loadings for PC1 and PC2")
PCA Loadings for PC1 and PC2
feature PC1 PC2
n_medications 0.6068491 0.08977977
n_encounters 0.5794060 0.16197231
n_er_admissions 0.4753720 0.22105757
age 0.2646587 -0.95751664
loading_vecs <- loadings |>
  mutate(
    PC1 = PC1 * max(abs(pc$PC1)),
    PC2 = PC2 * max(abs(pc$PC2))
  )

p1 <- ggplot(pc, aes(PC1, PC2)) +
  geom_point(alpha = 0.4) +
  geom_segment(
    data = loading_vecs,
    aes(x = 0, y = 0, xend = PC1, yend = PC2),
    arrow = arrow(length = unit(0.2, "cm")),
    color = "black"
  ) +
  geom_text(
    data = loading_vecs,
    aes(x = PC1, y = PC2, label = feature),
    vjust = 1.2
  ) +
  jam_theme +
  xlab(paste0("PC", 1, " (", var_exp[1], "%)")) +
  ylab(paste0("PC", 2, " (", var_exp[2], "%)"))
p1

Notes

Run time

Sys.time() - start_time
Time difference of 5.528315 secs

Session

sessionInfo()
R version 4.5.0 (2025-04-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4   stringr_1.5.1 tidyr_1.3.1   readr_2.1.5   gt_1.0.0     
[6] plotly_4.11.0 ggplot2_4.0.0 arrow_20.0.0 

loaded via a namespace (and not attached):
 [1] sass_0.4.10        generics_0.1.4     xml2_1.3.8         stringi_1.8.7     
 [5] lattice_0.22-7     hms_1.1.3          digest_0.6.37      magrittr_2.0.3    
 [9] evaluate_1.0.3     grid_4.5.0         RColorBrewer_1.1-3 fastmap_1.2.0     
[13] jsonlite_2.0.0     Matrix_1.7-3       mgcv_1.9-3         httr_1.4.7        
[17] purrr_1.0.4        crosstalk_1.2.1    viridisLite_0.4.2  scales_1.4.0      
[21] lazyeval_0.2.2     cli_3.6.5          rlang_1.1.6        splines_4.5.0     
[25] bit64_4.6.0-1      withr_3.0.2        yaml_2.3.10        tools_4.5.0       
[29] tzdb_0.5.0         reticulate_1.42.0  assertthat_0.2.1   vctrs_0.6.5       
[33] R6_2.6.1           png_0.1-8          lifecycle_1.0.4    htmlwidgets_1.6.4 
[37] bit_4.6.0          pkgconfig_2.0.3    pillar_1.10.2      gtable_0.3.6      
[41] glue_1.8.0         data.table_1.17.2  Rcpp_1.0.14        xfun_0.52         
[45] tibble_3.2.1       tidyselect_1.2.1   knitr_1.50         dichromat_2.0-0.1 
[49] farver_2.1.2       nlme_3.1-168       htmltools_0.5.8.1  labeling_0.4.3    
[53] rmarkdown_2.29     compiler_4.5.0     S7_0.2.0